Tagged with

Platform engineering

This post thumbnail

21 May 2026 09:00 AM

Request count works for normal web apps, but it breaks down when you serve LLMs on Kubernetes. Prompt length, output length, RAG context, KV cache pressure, GPU capacity, latency, and observability are all driven by tokens, not requests.

This post thumbnail

14 May 2026 09:00 AM

A practical introduction to why LLM serving breaks the usual web-app scaling playbook: requests become token streams, latency splits into TTFT and TPOT, replicas may span GPUs or nodes, memory becomes KV cache, and autoscaling needs workload-aware signals instead of CPU alone.

This post thumbnail

1 May 2026 08:00 AM

Platform engineering isn't rebranded DevOps. Here's what actually changes in your day job, your skills, and your salary when you make the shift, from someone living it.